Fade-in and fade-out temporal segments

ABSTRACT

A method for performing content-based temporal segmentation of video sequences, the method comprises the steps of transmitting the video sequence to a processor; identifying within the video sequence a plurality of type-specific individual temporal segments using a plurality of type-specific detectors; analyzing and refining the plurality of type-specific individual temporal segments identified in the identifying the plurality of type-specific individual temporal segments step; and outputting a list of locations within the video sequence of the identified type-specific individual temporal segments.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This is a divisional of application Ser. No. 08/902,545, filedJul. 29, 1997 by Warnick et al., entitled A METHOD FOR CONTENT-BASEDTEMPORAL SEGMENTATION OF VIDEO.

FIELD OF THE INVENTION

[0002] The invention relates generally to the field of visualinformation management, and in particular to computer-implementedprocessing for content-based temporal segmentation of video sequences.

BACKGROUND OF THE INVENTION

[0003] Efficient representation of visual content of video streams hasemerged as the primary functionality in distributed multimediaapplications, including video-on-demand, interactive video,content-based search and manipulation, and automatic analysis ofsurveillance video. A video stream is a temporally evolving medium,where content changes occur due to camera shot changes, special effects,and object/camera motion within the video sequence. Temporal videosegmentation constitutes the first step in content-based video analysis,and refers to breaking the input video sequence into multiple temporalunits (segments) based upon certain uniformity criteria.

[0004] Automatic temporal segmentation of video sequences has previouslycentered around the detection of individual camera shots, where eachshot contains the temporal sequence of frames generated during a singleoperation of the camera. Shot detection is performed by computingframe-to-frame similarity metrics to distinguish intershot variations,which are introduced by transitions from one camera shot to the next,from intrashot variations, which are introduced by object and or cameramovement as well as by changes in illumination. Such methods arecollectively known as video shot boundary detection (SBD). Various SBDmethods for temporal video segmentation have been developed. Thesemethods can be broadly divided into three classes, each employingdifferent frame-to-frame similarity metrics: (1) pixel/block comparisonmethods, (2) intensity/color histogram comparison methods, and (3)methods which operate only on compressed, i.e., MPEG encoded videosequences (see K. R. Kao and J. J. Hwang, Techniques and Standards forImage, Video and Audio Coding, Chapters 10-12, Prentice-Hall, NewJersey, 1996).

[0005] The pixel-based comparison methods detect dissimilarities betweentwo video frames by comparing the differences in intensity values ofcorresponding pixels in the two frames. The number of the pixels changedare counted and a camera shot boundary is declared if the percentage ofthe total number of pixels changed exceeds a certain threshold value(see H J. Zhang, A. Kankanhalli and S. W. Smoliar, “Automaticpartitioning of full-motion video,” ACM/Springer Multimedia Systems,Vol. 1(1), pp. 10-28, 1993). This type of method can produce numerousfalse shot boundaries due to slight camera movement, e.g., pan or zoom,and or object movement. Additionally, the proper threshold value is afunction of video content and, consequently, requires trial-and-erroradjustment to achieve optimum performance for any given video sequence.

[0006] The use of intensity/color histograms for frame contentcomparison is more robust to noise and object/camera motion, since thehistogram takes into account only global intensity/color characteristicsof each frame. With this method, a shot boundary is detected if thedissimilarity between the histograms of two adjacent frames is greaterthan a pre-specified threshold value (see H. J. Zhang, A. Kankanhalliand S. W. Smoliar, “Automatic partitioning of full-motion video”,ACM/Springer Multimedia Systems, Vol. 1(1), pp. 10-28, 1993). As withthe pixel-based comparison method, selecting a small threshold valuewill lead to false detections of shot boundaries due to the object andor camera motions within the video sequence. Additionally, if theadjacent shots have similar global color characteristics but differentcontent, the histogram dissimilarity will be small and the shot boundarywill go undetected.

[0007] Temporal segmentation methods have also been developed for usewith MPEG encoded video sequences (see F. Arman, A. Hsu and M. Y. Chiu,“Image Processing on Compressed Data for Large Video Databases,”Proceedings of the 1st ACM International Conference on Multimedia, pp.267-272, 1993). Temporal segmentation methods which work on this form ofvideo data analyze the Discrete Cosine Transform (DCT) coefficients ofthe compressed data to find highly dissimilar consecutive frames whichcorrespond to camera breaks. Again, content dependent threshold valuesare required to properly identify the dissimilar frames in the sequencethat are associated with camera shot boundaries. Additionally, numerousapplications require input directly from a video source (tape and orcamera), or from video sequences which are stored in different formats,such as QuickTime, SGImovie, and AVI. For these sequences, methods whichwork only on MPEG compressed video data are not suitable as they wouldrequire encoding the video data into an MPEG format prior to SBD.Additionally, the quality of MPEG encoded data can vary greatly, thuscausing the temporal segmentation from such encoded video data to be afunction of the encoding as well as the content.

[0008] The fundamental drawback of the hereinabove described methods isthat they do not allow for fully automatic processing based upon thecontent of an arbitrary input video, i.e., they are not truly domainindependent. While the assumption of domain independence is valid forcomputation of the frame similarity metrics, it clearly does not applyto the decision criteria, particularly the selection of the thresholdvalues. Reported studies (see D. C. Coll and G. K. Choma, “ImageActivity Characteristics in Broadcast Television,” IEEE Transactions onCommunication, pp. 1201-1206, October 1976) on the statistical behaviorof video frame differences clearly show that a threshold value that isappropriate for one type of video content will not yield acceptableresults for another type of video content.

[0009] Another drawback of the hereinabove methods is that they arefundamentally designed for the identification of individual camerashots. i.e., temporal content changes between adjacent frames. Completecontent-based temporal segmentation of video sequences must also includeidentification of temporal segments associated with significant contentchanges within shots as well as a the temporal segments associated withvideo editing effects, i.e., fade, dissolve, and uniform intensitysegments. Methods have be developed to specifically detect fade (U.S.Pat. No. 5,245,436) and dissolve (U.S. Pat. No. 5,283,645) segments invideo sequences, but when any of the hereinabove methods are modified inan attempt to detect the total set of possible temporal segments, theirperformance is compromised. Such modifications commonly require morecontent dependent thresholds, each of which must be established for thespecific video content before optimum performance can be achieved.

[0010] Therefore, there is a need for a method and system for performingaccurate and automatic content-based temporal segmentation of videosequences.

SUMMARY OF THE INVENTION

[0011] The present invention is directed to overcoming the problems setforth above. One aspect of the invention is directed to a method forperforming content-based temporal segmentation of video sequencescomprising the steps of: (a) transmitting the video sequence to aprocessor; (b) identifying within the video sequence a plurality oftype-specific individual temporal segments using a plurality oftype-specific detectors; (c) analyzing and refining the plurality oftype-specific individual temporal segments identified in step (b); and(d) outputting a list of locations within the video sequence of theidentified type-specific individual temporal segments.

[0012] It is accordingly an object of this invention to overcome theabove described shortcomings and drawbacks of the known art.

[0013] It is still another object to provide a computer-implementedmethod and system for performing accurate automatic content-basedtemporal segmentation of video sequences.

[0014] Further objects and advantages of this invention will becomeapparent from the detailed description of a preferred embodiment whichfollows.

[0015] These and other aspects, objects, features, and advantages of thepresent invention will become more fully understood and appreciated froma review of the following description of the preferred embodiments andappended claims, and by reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016]FIG. 1 is block schematic of a computer-implemented method forcontent-based temporal segmentation of video sequences;

[0017]FIG. 2 is a detailed flow chart of the shot boundary detectioncomponent of the method;

[0018]FIG. 3 illustrates the individual frame color component histogramsand color histogram difference for two adjacent frames of a videosequence;

[0019]FIG. 4 is a temporal plot of the frame color histogram differencesthat illustrates the process of elimination of false positives;

[0020]FIG. 5 is detailed flow chart of the uniform segment detectioncomponent of the method;

[0021]FIG. 6 is a detailed flow chart of the fade segment detectioncomponent of the method;

[0022]FIG. 7 is a temporal plot of the difference in frame colorhistogram variance that illustrates the process of detecting fadesegments which are associated with uniform segments;

[0023]FIG. 8 is a diagram illustrated the format of the list of temporalsegment locations; and

[0024]FIG. 9 is a flow chart of an alternative embodiment of theinvention that performs temporal segmentation of a video sequence usingtemporal windows.

[0025] To facilitate understanding, identical reference numerals havebeen used, where possible, to designate identical elements that arecommon to the figures.

DETAILED DESCRIPTION OF THE INVENTION

[0026] As used herein, computer readable storage medium may comprise,for example, magnetic storage media such as magnetic disk (such asfloppy disk) or magnetic tape; optical storage media such as opticaldisc, optical tape, or machine readable bar code; solid state electronicstorage devices such as random access memory (RAM), or read only memory(ROM); or any other physical device or medium employed to store acomputer program or data. A processor as used herein can include one ormore central processing units (CPUs).

[0027] A video sequence as used herein is defined as a temporallyordered sequence of individual digital images which may be generateddirectly from a digital source, such as a digital electronic camera orgraphic arts application on a computer, or may be produced by thedigital conversion (digitization) of the visual portion of analogsignals, such as those produced by television broadcast or recordedmedium, or may be produced by the digital conversion (digitization) ofmotion picture film. A frame as used herein is defined as the smallesttemporal unit of a video sequence to be represented as a single image. Ashot as used herein is defined as the temporal sequence of framesgenerated during a single operation of a capture device, e.g., a camera.A fade as used herein is defined as a temporal transition segment withina video sequence wherein the pixels of the video frames are subjected toa chromatic scaling operation. A fade-in is the temporal segment inwhich the video frame pixel values change from a spatially uniform value(nominally zero) to their normal values within the shot. Conversely, afade-out is the temporal segment in which the video frame pixel valueschange from their normal values to a spatially uniform value (nominallyzero). A dissolve as used herein is defined as a temporal transitionsegment between two adjacent camera shots wherein the frame pixels inthe first shot fade-out from their normal values to a zero pixel valueconcurrent with a fade-in of the frame pixels in the second shot from azero pixel value to their normal frame pixel values.

[0028] As used herein, a temporal segment comprises a set of temporallyconsecutive frames within a video sequence that contain similar content,either a portion of a camera shot, a complete camera shot, a cameragradual transition segment (fade or dissolve), a blank content (uniformintensity) segment, or an appropriate combination of one or more ofthese. Temporal segmentation refers to detection of these individualtemporal segments within a video sequence, or more correctly, detectingthe temporal points within the video sequence where the video contenttransitions from one temporal segment to another. In order to detect theboundary between temporally adjacent segments, successive frame pairs inthe input video sequence are processed by a computer algorithm to yieldframe content comparison metrics that can be subsequently used toquantify the content similarity between subsequent frames.

[0029] Referring to FIG. 1, there is shown a schematic diagram of acontent-based temporal segmentation method. The input video sequence 110is processed 120 to determine the locations of the temporal segments 130of the video sequence 110. Accurate detection of the different types oftemporal segments within a video sequence requires that separate methodsbe employed, one for each type of temporal segment. Therefore, theprocess 120 of determining the locations of temporal segments 130 isachieved by the application of four type-specific temporal segmentdetection methods. Specifically, the method of content-based temporalsegmentation 120 comprises detecting 140 camera shot boundaries (i.e.,cuts), detecting 150 fade-in and fade-out segments, detecting 160dissolve segments, and detecting 170 uniform color/gray level segments.The output from these individual detection processes is a list 145 ofshot boundary locations, a list 155 of fade segment locations, a list165 of dissolve segment locations, and a list 175 of uniform segmentlocations. These four lists of temporal segment locations are analyzedand refined 180 in order to resolve conflicts that may arise among thefour detection processes and to consolidate the four lists into a singlelist 130 of temporal segment locations. Each of the type-specifictemporal segment detection methods will be discussed in detailhereinbelow.

[0030] Shot Boundary Detection

[0031] Referring now to FIG. 2, the method of camera shot boundary (cut)detection 140 involves the computation of multiple frame comparisonmetrics in order to accurately detect the locations in the videosequence in which there is significant content change betweenconsecutive frames, i.e., camera shot boundaries. In the preferredembodiment of the present invention, two different frame comparisonmetrics are computed. The first is a frame-to-frame color histogramdifference metric 210 which is a measure of the color similarity ofadjacent frames in the video sequence 110. This metric, as statedhereinbefore, is sensitive only to global color changes and relativelyinsensitive to object/camera motion. At camera shot boundaries, due tothe sudden change in frame content characteristics, this metric willtake on a value higher than that within a camera shot. However,different shots can have very similar color characteristics while havingsignificantly different content, thus producing a small value in thecolor histogram frame difference metric at the shot boundary. Therefore,the color histogram frame difference metric 210 is supplemented with apixel intensity frame difference metric 220, which is more sensitive tospatially localized content changes. This frame pixel difference metric220 is a measure of the spatial similarity of adjacent frames in thevideo sequence 110 and will produce a large value at shot boundarieseven when the color characteristics of the two shots are similar.However, this metric is more sensitive to local spatial contentvariations within a shot. Therefore, the output from these two metricsis combined to produce a more reliable indication of the true shotboundary locations.

[0032] The color histogram frame difference metric 210 is computed asthe pairwise color histogram absolute difference between two successiveframe histograms:${HD} = \frac{\sum\limits_{j}{{{H_{I - 1}(j)} - {H_{I}(j)}}}}{NP}$

[0033] where

[0034] HD is the color histogram absolute difference comparison metric,

[0035] H_(I-1)(j) is the jth element of the histogram from frame I-1,

[0036] H_(I)(j) is the jth element of the histogram from frame I, and

[0037] NP is the number of pixels in the frame image.

[0038] The color histogram H_(I)(j) of each frame is computed from 24bit YCbCr color pixel values. Color histograms for each component arecomputed individually and then concatenated to form a single histogram(see FIG. 3). Those skilled in the art will recognize that other colorspaces, such as RGB, YIQ, L*a*b*, Lst, or HSV can be employed withoutdeparting from the scope of the invention. Additionally,multidimensional histograms or other methods for color histogramrepresentation, as well as an intensity or luminance only histogram maybe employed for histogram computation without departing from the scopeof the invention. The selected color space can also be quantized toyield a fewer number of bins for each color component histogram.

[0039] The pixel intensity frame difference metric 220 is computed as$\begin{matrix}{{{PD}\quad \left( {x,y} \right)} = 1} & {{{{if}\quad {{{F_{I - 1}\left( {x,y} \right)} - {F_{I}\left( {x,y} \right)}}}}\rangle}\quad {NV}} \\0 & {else}\end{matrix}$ Then${FPD} = \frac{\sum\limits_{x}{\sum\limits_{y}{{PD}\quad \left( {x,y} \right)}}}{NP}$

[0040] where

[0041] PD(x, y) is the pairwise pixel difference at location (x,y)

[0042] F_(I-1)(x, y) is the pixel value at location (x, y) in frame I-1,

[0043] F_(I)(x, y) is the pixel value at location (x, y) in frame I,

[0044] NV is a noise value which PD(x, y) must exceed,

[0045] FPD is the frame pixel difference metric, and

[0046] NP is the number of pixels in the frame image.

[0047] The frame pixel value used in F_(I)(x, y) and F_(I-1)(x, y) iscomputed as a weighted sum of the pixel color component values atlocation (x, y) in frames I and I-1 respectively. The noise value NV,used to reduce the metric's sensitivity to noise and smallinconsequential content changes, is determined empirically. In thepreferred embodiment, a value of 16 for NV has been determined to beadequate to provide the desired noise insensitivity for a wide varietyof video content. Those skilled in the art will recognize that the pixelintensity frame difference can be computed from pixel values in variouscolor spaces, such as YCbCr, RGB, YIQ, L*a*b*, Lst, or HSV withoutdeparting from the scope of the invention. Additionally, the selectedpixel value space can be quantized to yield a reduced dynamic range,i.e., fewer number of pixel values for each color component histogram.

[0048] The color histogram frame difference HD 210 and the pixelintensity frame difference FPD 220 are computed for every frame pair inthe video sequence 110. Notice that no user adjustable threshold valueis employed in the computation of either metric. Both sets ofdifferences are passed into a k-means unsupervised clustering algorithm230 in order to separate the difference data into two classes. This twoclass clustering step 230 is completely unsupervised, and does notrequire any user-defined or application-specific thresholds orparameters in order to achieve optimum class separation. The k-meansclustering 230 is a well known technique for clustering data intostatistically significant classes or groups (see R. O. Duda and P. E.Hart, Pattern Classification and Scene Analysis, pp. 201-202, Wiley, NewYork, 1973), the details of which will not be discussed herein. Thoseskilled in the art will appreciate that other cluster algorithms (see A.K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall,New Jersey, 1988) can be employed to separate the data into two classeswithout departing from the scope of the invention. The k-means algorithmperforms two class clustering on the frame comparison metricsiteratively, until the clustering process converges to two distinctclasses 240, one representing the potential shot boundary locations andthe other representing the non-shot boundary locations. The set ofnon-shot boundary locations is normally deleted.

[0049] The set of potential shot boundary locations contains both trueshot boundary locations and a number of non-shot boundary locations(false positives) due to the overlap of the two classes in feature spaceafter clustering. Therefore, the set of potential shot boundarylocations is analyzed and refined 250 using the data from the set ofcolor histogram frame differences. Referring now to FIG. 4, thisrefinement is accomplished by examining the color histogram framedifferences for a local maxima at each location identified 410 as apotential shot boundary in the set of potential shot boundary locations.Two cases exist for refinement of the potential shot boundary locations:

[0050] Case (i)—If no other potential shot boundary exists within ±D1frames of this location, then the frame histogram difference metricvalue must be greater than the metric value on either side by X1 % to bea shot boundary. If so, then leave the location in the set of potentialshot boundary locations. If not, then discard this location from the setof potential shot boundary locations.

[0051] Case (ii)—If another potential shot boundary exists within ±D1frames of this location, then the frame histogram difference metricvalue must be greater than the metric value on either side by X2 % to bea shot boundary, where X2 is greater than X1. If so, then leave thelocation in the set of potential shot boundary locations. If not, thendiscard this location from the set of potential shot boundary locations.

[0052] The optimum values for the parameters D1, X1, and X2 can bedetermined empirically. In the preferred embodiment, the values for D1,X1, and X2 are preset to 11, 06%, and 12% respectively. These valueshave been shown to yield excellent performance on video sequencescontaining a wide variety of content.

[0053] The result of this refinement 250 is the elimination of falsepositive locations from the list of potential shot boundaries, resultingin the final list 145 of shot boundary locations within the videosequence, each identified by numerical frame number. Those skilled inthe art will appreciate that other frame comparison metrics can be usedin either place of or in conjunction with the color histogram and pixeldifference metrics described hereinabove without departing from thescope of the invention. Functions such as difference in framedifferences, absolute frame differences, chi-square test for colorhistogram comparison, or any other function that yields sharpdiscontinuities in the computed metric values across shot boundarieswhile maintaining a low level of activity within individual shots can beemployed. Furthermore, the comparison function may be computed over theentire frame, or only within a certain predefined spatial window withinthe frame, or over corresponding multiple spatial segments withinsuccessive frames. Multiple functions for frame comparison can becomputed for every frame pair and all features may simultaneously beutilized as elements of a feature vector representing framesimilarities. These feature vectors may then be employed in theclustering algorithm described hereinabove, and the shot boundarydetection threshold may be obtained in the N-dimensional feature space.Alternatively, in place of computing the frame comparison metrics fromthe actual video sequence frames, such comparison metrics can be derivedfrom difference images, motion vectors, DC images, edge images, framestatistics, or the like, which themselves are derived from theindividual frames of the video sequence. Prior to clustering, thecalculated frame comparison metrics can be preprocessed using medianfiltering, mean filtering, or the like, to eliminate falsediscontinuities/peaks that are observed due to content activity within ashot segment. Additionally, the input video sequence can be temporallysampled, and individual frames in the video sequence may be spatiallysampled to reduce the amount of data processing in order to improvealgorithm speed and performance.

[0054] Uniform Segment Detection

[0055] Returning now to FIG. 1, the video sequence 110 is also analyzedto detect 170 uniform temporal segments. Such segments frequently occurin video sequences in order to add a temporal spacing, or pause, in thepresentation of content. The computed frame color histogram data used inthe shot boundary detection as described hereinabove is also utilizedfor detecting temporal segments of uniform color/intensity. Referring toFIG. 5, the mean and variance of the individual color components in thecolor histogram are computed 510 for each frame in the video sequence110: ${HM}_{I} = {\frac{1}{NP}{\sum\limits_{j}{{jH}_{I}(j)}}}$

[0056] where

[0057] HM_(I) is the histogram mean value for frame I,

[0058] H_(I)(j) is the j^(th) histogram value for frame I, and

[0059] NP is the number of pixels in frame I,

[0060] and${HV}_{I} = {\frac{1}{NP}{\sum\limits_{j}{j\quad \left( {j - {HM}_{I}} \right)^{2}}}}$

[0061] where HV_(I) is the histogram variance value for frame I.

[0062] If a frame has a luminance component variance less than apredetermined amount V1, then that frame is selected 520 as a uniformframe and its temporal location is appended to the list 175 of uniformsegment locations. All frames in the sequence are processed 525 toinitially locate the potential uniform frames. This process is followedby a refinement process 530 to group the identified frames intocontiguous temporal segments. In that process 530, if a uniform framehas been previously identified D2 frames prior, then all intermediateframes are selected as uniform and their temporal locations are appendedto the list 175 of uniform segment locations. Finally, if the number oftemporally adjacent frames in the uniform segment is less than M1 (theminimum number of frames that can constitute a uniform temporalsegment), then delete the temporal locations of these frames from thelist 175 of uniform segment locations. The optimum values for theparameters D2, V1, and M1 can be determined empirically. In thepreferred embodiment, the values of D2, V1, and M1 are preset to 3, 0.1,and 15 respectively. These values have been shown to yield excellentperformance on video sequences containing a wide variety of content. Thefinal result of this uniform segment detection process 170 is a list 175of uniform segment locations within the video sequence 110, eachidentified by a start frame and end frame number.

[0063] Fade Segment Detection

[0064] Referring to FIG. 1, the video sequence 110 is now analyzed 150to detect fade-in/fade-out temporal segments. Fade segments in the videosequence 110 are temporally associated with uniform temporal segments,i.e., a fade segment will be immediately preceded or proceeded by auniform segment. The beginning of each uniform temporal segment maycorrespond to the end of a fade-out segment. Likewise, the end of eachuniform temporal segment may correspond to the beginning of a fade-insegment. Thus, it is sufficient to carry out fade temporal segmentdetection on the endpoints of every isolated uniform temporal segment.

[0065] Referring to FIGS. 6 and 7, fade detection begins by locating 605each of the uniform segments in the video sequence 110 previouslyidentified by the uniform segment detection 170. The endpoints of eachuniform segment 705, i.e., the beginning 710 and end 720 frames, aretemporally searched over a immediately adjacent temporal window 720 oflength W. For fade-out detection 610, frame index I is set to the firstframe 710 of the uniform temporal segment 705. The difference in thecolor histogram variance between frames I-1 and I is computed as

A _(FO) =HV _(I) −HV _(I-1)

[0066] If this difference A_(OF) is greater than zero but less than anamount ΔHV, then frame I-1 is labeled as a fade-out frame. The frameindex I is decremented, and the differences in color histogram varianceare observed in a similar manner for all the frames that lie inside thewindow 730 of size W. If at any point in the analysis the colorhistogram variance difference A_(FO) exceeds an amount ΔHV_(max), thenthe fade-out detection process 610 is terminated and fade-in detection620 is initiated within the window 730 at the opposite end of theuniform temporal segment 705.

[0067] The interframe variance difference A_(FO) may sometimes fallbelow zero, due to noise in the subject frames or minute fluctuations inthe luminance characteristics. In order to avoid mis-classifications dueto such effects, the difference between I-2 and I is considered if thevariance difference between frames I-1 and I falls below zero. If thissecond difference is found to be above zero, and if the variancedifference B between frames I-2 and I-1 is found to satisfy theconditions 0<B<ΔHV, then frame I-1 is labeled as a fade-out frame andfade-out detection 610 proceeds as before.

[0068] For fade-in identification 620, frame index I is set to the lastframe 720 of the uniform temporal segment 705. The difference in thecolor histogram variance between frames I+1 and I is computed as

A _(FI) =HV _(I+1) −HV _(I)

[0069] If this difference A_(FI) is greater than zero but less than anamount ΔHV, then frame I+1 is labeled as a fade-in frame. The frameindex I is incremented, and the differences in color histogram varianceare observed in a similar manner for all the frames that lie inside thewindow 730 of size W. If at any point in the analysis the colorhistogram variance difference A_(FI) exceeds an amount ΔHV_(max), thenthe fade-out detection process 620 is terminated, and the nextpreviously identified uniform temporal segment in the video sequence issimilarly analyzed. As with the detection 610 of fade-out temporalsegments, the interframe variance difference A_(FI) may sometimes fallbelow zero, due to noise in the subject frames or minute fluctuations inthe luminance characteristics. In order to avoid mis-classifications dueto such effects, the difference between I+2 and I is considered if thevariance difference between frames I+1 and I falls below zero. If thissecond difference is found to be above zero, and if the variancedifference B between frames I+2 and I+1 is found to satisfy theconditions 0<B<ΔHV, then frame I+1 is labeled as a fade-in frame andfade-in detection 610 proceeds as before. This process continues untilall detected uniform temporal segments have been similarly analyzed.

[0070] When all frames within the window 730 have been processed foreither fade-in or fade-out, fade detection is terminated, regardless ofwhether the variance differences continue to satisfy the conditionspreviously defined. Local averaging by mean filtering may be carried outon the variances of those frames that fall inside the window 730, inorder to eliminate slight local variations in the variancecharacteristics that may yield false detection. In another embodiment,the window constraint may be removed, and fade detection may be carriedout until the stated conditions are no longer satisfied. In thepreferred embodiment, the values for ΔHV, ΔHV_(max), and W are preset to

ΔHV=0.1×Var(i)

ΔHV _(max)=32×Var(i)

W=5

[0071] where Var(i) is the computed color histogram variance of frame I.These values have been shown to yield excellent performance on videosequences containing a wide variety of content. The final result of thisfade detection process 150 is a list 155 of fade segment locationswithin the video sequence 110, each identified by a start frame and endframe number.

[0072] Dissolve Segment Detection

[0073] Referring again to FIG. 1, the video sequence 110 is analyzed todetect 165 dissolve temporal segments. Any of the known methods fordetecting dissolve temporal segments can be employed. For example,Alattar (U.S. Pat. No. 5,283,645) discloses a method for the compressionof dissolve segments in digital video sequences. In that method, thedissolve segments are detected prior to compression by analyzing thetemporal function of interframe pixel variance. Plotting this functionreveals a concave upward parabola in the presence of a dissolve temporalsegment. Detection of a dissolve temporal segment is thereforeaccomplished by detecting its associated parabola which is present thetemporal function of interframe pixel variance. Those skilled in the artwill recognize that other known methods of characterizing a dissolvetemporal segment may be employed without departing from the scope of theinvention. The final result of this dissolve detection process 160 is alist 165 of fade segment locations within the video sequence 110, eachidentified by a start frame and end frame number.

[0074] Refine and Combine Locations

[0075] After detection of the four types of temporal segments, theresulting four lists of temporal segment locations are refined andcombined 180 to produce a single list 130 of the locations of theindividual temporal segments contained in the video sequence 110. In therefinement process 180, each detected shot boundary location is checkedagainst the detected fade segment locations, uniform segment locations,and dissolve segment locations. If any frame that has been detected as ashot boundary has also been flagged as part of a fade, dissolve, oruniform segment, that frame is removed from the list of shot boundarylocations. Additionally, adjacent shot boundaries that are closer than apredefined number of frames, i.e., the minimum number of frames requiredto call a temporal segment a shot, are dropped. Spurious shot boundariesthat are detected as a result of sudden increases in frame luminancecharacteristics are eliminated by a flash detection process. Flashdetection involves discarding the shot boundary locations where a suddenincrease in frame luminance is registered for the duration of a singleframe. Such frames exist, for example, in outdoor scene where lightningis present. In the flash detection process, the frame statistics of theframe immediately prior to and following such a frame are observed todetermine whether the frame color content remains constant. If this isthe case, the sudden luminance change is labeled as a flash and does notsignal the beginning of a new temporal segment. In the preferredembodiment, the mean of the frame luminance is used as the framestatistic for flash detection. After the refinement process is complete,the four lists of temporal segment locations are combined to produce alist 130 of temporal segment locations (see FIG. 8).

[0076] In the preferred embodiment described hereinabove, the framecolor histogram difference and frame pixel difference metrics arecomputed for the entire video sequence 110 prior to clustering in orderto produce the list of potential shot boundary locations. This is anacceptable approach for video sequences that can be processed off-line.For video sequences which required more immediate results or for videosequences of long duration, an alternative embodiment of the inventioncomputes these frame difference metrics from frames within smallertemporal regions (windows) to provide a “semi-on-the-fly”implementation. The length of the temporal window can a predeterminedamount, measured in frames or seconds. The only requirement is thatwithin the temporal window there exist at least one true camera shotboundary for the clustering process to work properly. Alternatively, thetemporal window length can be computed so as to insure that there existsat least one true shot boundary within the window. In this embodiment,the variance of the color histogram difference is computed at everyframe as it is processed. The running mean and variance of this metricis computed sequentially as the frames of the video sequence areprocessed. At each significant shot boundary in the video sequence, therunning variance value will show a local maximum value due to thesignificant change in the color histogram difference metric at thistemporal location. When the number of local maxima is greater than LM,the temporal window length for the first window is set to encompass allframes up to that point and the data for the two difference metrics(color histogram difference and frame pixel difference) are passed intothe clustering process as described hereinbefore. The running mean andvariance value are reset and the process continues from that point todetermine the length of the next temporal window. This process continuesuntil the entire video sequence is processed. In this manner, the videosequence is parsed into smaller sequences so that the clustering andrefinement results (shot boundary locations) are available for eachsmaller sequence prior to the completion of the processing for the fullvideo sequence. The value of LM can be determined empirically. In thepreferred embodiment, the value of LM is preset to 5. This value insuresthat the class of shot boundaries will be sufficiently populated for thehereinabove described clustering process and has been shown to yieldexcellent performance on video sequences containing a wide variety ofcontent.

[0077] In summary, the hereinabove method and system performs accurateand automatic content-based temporal segmentation of video sequenceswithout the use of content specific thresholds.

[0078] The invention has been described with reference to a preferredembodiment. However, it will be appreciated that variations andmodifications can be effected by a person of ordinary skill in the artwithout departing from the scope of the invention.

What is claimed is:
 1. A method for performing content-based temporal segmentation of video sequences comprising the steps of (a) transmitting the video sequence to a processor; (b) identifying within the video sequence a plurality of type-specific individual temporal segments using a plurality of type-specific detectors; (c) detecting the content of the plurality of type-specific individual temporal segments, and refining the plurality of type-specific individual temporal segments identified in step (b), including eliminating spurious shot boundaries; and (d) outputting a list of locations within the video sequence of the identified type-specific individual temporal segments.
 2. The method of claim 1 , wherein step (b) includes the step of identifying individually or any combination of camera shot temporal segments, uniform intensity temporal segments, fade-in and fade-out temporal segments, or dissolve temporal segments.
 3. The method of claim 2 , wherein the step of identifying fade-in and fade-out temporal segments includes analyzing temporal frame color component histogram variance.
 4. A method for performing content-based temporal segmentation of video sequences comprising the steps of: (a) transmitting the video sequence to a processor; (b) identifying within the video sequence fade in and fade out temporal segments by analyzing temporal frame color component histogram variance; and (c) outputting a list of locations within the video sequence of the fade in and fade out temporal segments.
 5. The method as in claim 4 , wherein step (b) includes identifying the uniform temporal segments within the video sequence prior to identifying the fade in and fade out temporal segments.
 6. A computer program product, comprising: a computer readable storage medium having a computer program stored thereon for performing the steps of: (a) transmitting the video sequence to a processor; (b) identifying within the video sequence a plurality of type-specific individual temporal segments using a plurality of type-specific detectors; (c) detecting the content of the plurality of type-specific individual temporal segments, and refining the plurality of type-specific individual temporal segments identified in step (b), including eliminating spurious shot boundaries; and (d) outputting a list of locations within the video sequence of the identified type-specific individual temporal segments.
 7. The computer program product of claim 6 , wherein step (b) includes the step of identifying individually or any combination of camera shot temporal segments, uniform intensity temporal segments, fade-in and fade-out temporal segments, or dissolve temporal segments.
 8. The computer program product of claim 7 , wherein the step of identifying fade-in and fade-out temporal segments includes analyzing temporal frame color component histogram variance.
 9. A computer program product, comprising: a computer readable storage medium having a computer program stored thereon for performing the steps of: (a) transmitting the video sequence to a processor; (b) identifying within the video sequence fade in and fade out temporal segments by analyzing temporal frame color component histogram variance; and (c) outputting a list of locations within the video sequence of the fade in and fade out temporal segments. 